INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

General Observations

Observations:

Minor Modifications

Notes:

All booking IDs are unique and can't help with building an ML-based model.

Notes:

Statistical Overview

Observations:

Numerical Variables:

Categorical variables:

Exploratory Data Analysis (EDA)

In the following EDA, we aim to address the questions asked below, but the EDA will be far more comprehensive than merely answering the leading questions.

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Univariate Analysis

No. of Adults

No. of Children

Notes:

The sudden jump from 3 to 9 in the no. of children, and the fact that only three bookings have these many children prompted us to treat these outliers by replacing their number of kids with 3. This would particularly help with logistic-regression-based modeling, which can be more sensitive to outliers than decision trees.

No. of Reserved Weekend Nights

No. of Reserved Week Nights

Notes:

Observations:

13 of reservations with no booked nights come from complimentary segment, for which this might be fine. But the 65 remaining bookings come from online segment. We choose to treat these entries by replacing their reserved week and weekend nights by their most propabale values (modes).

Notes:

Mode was chosen over median for imputation, because, in the case of selecting medians, the total nights for the imputed rows would've become three with 2 week nights, and one weekend night, which seems a bit too specific compared to two week nights only. Also changing total nights from 0 to 3 seemed a bit too drastic to me.

Meal Plan

Parking Spaces

Arrival Year

Arrival Month

Notes:

As can be seen, data for only six months of 2017 is available. So a more fair analysis might be only limiting the arrival month countplot to 2018.

Type of Market Segment

No. of Special Requests

Repeated Guest

No. of Previous Cancelled Reservations

No. of Previous Not Cancelled Reservations

Notes:

Cancellation Status

Observations from the Univariate Analysis of Categorical Variables:

Average Price per Room

Notes:

Lead Time

Arrival Date

Observations from the Univariate Analysis of Numerical Variables:

Bivariate Analysis

The provided dataset is comprised of many columns, making the possibilities for bivariate analysis quite large. Therefore, we limit our bivariate anaylysis to the understanding of the relationships between some of the most important parameters, while being guided by the leading questions.

Correlation Heatmap of Numerical Variables

Observations:

Price vs. Market Segment

Observations:

Price vs. Room Type

Observations

Cancellation Status vs. Room Type

Observations:

Guests staying at room type 6 have the highest cancellation rate, while those staying at room type 7 are the least likely ones to cancel, but, overall, there isn't a clear-cut relationship between cancellation rate and room type, and for many of the room types, it's just around 1/3, which is nearly the cancellation rate of the full data.

Cancellation Status vs. Market Segment

Observations:

Customers from online segment have the highest percentage of cancellation. As expected, nobody has refused a complimentary rate and none of the bookings in this segment have been cancelled.

Cancellation Status vs. Special Requests

Observations:

Price vs. Special Requests

Observations:

Cancellation Status vs.Price

Observations:

Cancellation Status vs. Lead Time

Observations:

Non-single (Adult) Travelers

Observations:

Cancellation Status vs. Repeated Guest

Observations:

However repeated guests constitute a small fraction of the whole data, but their cancellation rate (1.7%) is much lower than those of the full data and the new guests. This is good news for the hotel chain, as it indicates that it's established a strong customer loyalty with its frequent guests.

Number of Cancelled and Visiting Guests vs. Month

Observations:

Percentage of Cancelled Reservations vs. Month

Observations:

Price vs. Month

Observations:

Data Preprocessing

Outlier Check

Observations:

Feature Engineering

Here, we will drop some of the unhelpful/temporary variables, replace the entries of cancellation status by a 0 and 1, and will convert categorical variables with logical (mathematical) orders into numerical variables.

Notes:

Arrival date was dropped, as we exhibited that different days of a month do not have any significance over each other.

Notes:

Here, December was set as the first month, so that all cold months of year with lower cancellation rates will be consecutively next to each other.

Post-imputation EDA (brief)

Average Price per Room

Observations:

The only change here is the replacement of the original maximum (540) by the second largest number (375.5).

No. of Children

Observations:

The only change here is the substitution of strangely large children numbers (9 and 10) by the next biggest number (3). This hasn't visibly changed the percentages, as only three entries had more than three kids.

No. of Reserved Weekend Nights

Observations:

Since week nights of online segment with zero total night were imputed by 0 (the mode), there is absolutely no change here.

No. of Reserved Week Nights

Observations:

Since some of the zero values (those of the online segment with zero total night) have been imputed by the mode of online segment week nights (2), the percentage of 0 week night has dropped by 0.2%, and that of 2 week nights has increased by the same amount.

Logistic Regression Model

Auxiliary Functions

Here, some additional, in-house functions are introduced to help us with evaluating and improving the performance of classification models (logistic-regression-based or decision-tree-based) in this project.

Preparing Data and Building Initial Model

Notes:

Examining the Performance of the Original Model

Observations:

Checking Multicollinearity

Observations:

There is no multicollinearity between any sets of predictors, and VIF for intercept and dummy variables remains a finite and defined value. Therefore, no multicollinearity problem can be reported.

Removing Statistically Insignificant Predictors

Model Performance After Removal of Statistically Insighnificant Variables

Observations:

Removing features with p-value >= 0.05 has slightly reduced the recall score of the model on both training and validation data.

Finding Optimal Thresholds

Here, we use ROC and precision-recall curves to determine thresholds that will maximize the difference between TPR (recall) and FPR, and will make recall and precision scores equal, respectively.

Observations:

Observations

Effect of Threshold on the Performance of Logistic Regression

Here, we evaluate the performance of the final logistic regression model as the threshold changes from 0.5 to optimal values.

Observations:

Observations:

Interpretation of Model's Parameters

Suppose b is the coeffcient of a certain predictor in the model. Then odds = exp(b), and (exp(b) - 1)*100 returns the percentage of the change in the probability of the target variable (here, cancellation), if all other variables remain constant.

Observations:

Decision Tree Model

Preparing Data and Building the Initial Model

Note that since random_state is the same between logistic regression and decision tree, same rows will be picked for training, validation and testing sets in both approaches.

Observations:

Pre-Pruning the Full Tree: Maximiazing Recall

We adopt two different approaches towards prepuning. First we attemt to maximize the recall score, and then we'll develop a pre-pruned tree that maximes the F1 score. This is aligned with the two objectives the hotel chain might have in mind.

Observations:

Pre-Pruning the Full Tree: Maximiazing F1

Notes:

Note that, here, we have capped the maximum depth of the tree at 8, otherwise it would grow even beyond its current complex structure and become far more intricate.

Observations:

Post-Pruning the Full Tree

Notes:

Here, class_weight was set equal to 'balanced', as we learned from pre-pruning that it yields more accurate models.

Observations:

Notes:

Observations:

Comparison of All Decision Trees

Observations:

Comparison of the Optimized Logistic Regression and Decision Tree Models

Here, the performance of all optimized models developed thus far (logistic regression or decision tree) will be compared to each other on the testing set, which has been left untouched up to this step. Note that since the performance of the raw (baseline) models (logistic regression model without looking at ROC or precision recall curves, or the full tree model) are inferior and suboptimal, we won't consider them here.

Observations:

Actionable Insights and Recommendations

Here, we do not intend to present a lengthy summary of all findings. Instead, we aim to touch base on the major ones, and draw actionable conclusions and recommendations from those.

Key observations from EDA and classification models:

Business recommendations:

Recommendations for a future ML-based modeling: